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ABSTRACT 


This  paper  describes  a framework  for  a computer  model 
of  stick  figure  understanding  in  which  semantic  inferences 
are  made  from  body  postures.  Stick  figures  of  humans  are 
shown  to  provide  a rich  environment  in  which  to  develop  and 
test  techniques  and  methods  of  visual  understanding.  The 
computer  system  which  will  implement  the  theory  is  called 
SKELETUN.  The  notion  of  a hierarchical  organization  of 
recognition  packets  running  in  parallel  is  introduced.  This 
paradigm  is  applied  to  an  example.  In  addition,  it  is  shown 
how  SKELETUN  relies  on  interaction  between  model-driven  and 
data-driven  modes. 
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framework  for  a computer  model  of  stick  figure  understanding  in 
which  semantic  inferences  are  made  from  body  postures.  Stick  fig- 
ures of  humans  are  shown  to  provide  a rich  environment  in  which  to 
develop  and  test  techniques  and  methods  of  visual  understanding. 

The  computer  system  which  will  implement  the  theory  is  called 
SKELETUN.  The  notion  of  a hierarchical  organization  of  recognition! 
packets  running  in  parallel  is  introduced.  This  paradigm  is  applief 
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1*  Introduction 


The  goal  of  aany  researcher*  In  artificial  Intelligence  Is 
to  produce  systeas  uhlch  eihibit  capabilities  that  are  noraelly 
attributed  to  huaan  beings*  If  these  capabilities  are  Inteprateo 
Into  a single  systeat  we  would  hope  that  the  systea  woulo  be  aole 
to  uncerstand  wno  Intelligently  Interact  with  a coaplea 
enwironaent*  The  coaplei  environaent  would  consist  not  only  of 
oojects  but  also  of  people*  An  Intelligent  systea  will  therefore 
have  to  be  able  to  look  at  people  and  understand  what  they  arc 
doing,  1*e*,  understand  their  actions  and  gestures*  This 
Includes  perceiving  and  understanding  the  actions  and  gestures  of 
Isolated  Individuals  as  well  as  understanding  the  actions  and 
gestures  of  groups  of  Individuals  Interacting  with  each  other. 
Cxaaples  of  actions  and  gestures  are  running,  throwing,  pointing, 
fighting,  and  playing  football* 

Mhat  Is  Involved  In  perceiving  and  understanding  actions  and 
gestures  of  people?  First,  a photograph  or  television  picture  of 
the  people  stust  be  digitised  and  stored  In  oeaory*  Then, 
features  such  as  shapes  of  body  parts,  connected  regions,  ana 
Joint  positions  oust  be  extracted*  Nest,  these  features  oust  be 
cooblned  In  order  to  ootaln  a labeling  of  the  body  regions,  e.g*, 
upper  ara,  lower  ara,  torso,  etc* 

Once  the  body  regions  have  been  labeled,  there  are  three 
aaln  types  of  understanding  which  can  occur*  The  first  type 
Involves  understanding  the  actions  and  gestures  of  Isolated 
Individuals,  assuaing  they  are  not  Interacting  with  other  people* 
Local  body  gestures  and  body  postures  of  the  Individual  are  used 
to  Infer  what  he  Is  doing* 

The  second  type  Involves  understanding  the  Interaction  of 
two  or  aore  people*  For  eiaaple.  If  a photograph  consists  of  one 
person  with  his  hands  stretched  straight  up  and  a second  person 
holding  a gun  which  Is  pointed  at  the  first  person,  then  our 
understanding  of  what  the  first  person  Is  doing  (I.e*,  eras 
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stretched  up)  is  increeted  ii  uc  take  into  Account  the  action  oi 
the  second  person;  th«t  is«  we  hypothesise  th«t  • robpery  way  be 
oc  c u rr iny  . 

The  third  type  oi  unoe  r 1 1 « no  i ng  involves  takin<*  » seouence 
oi  photographs  ci  people  and  trying  to  understand  the  weaning  oi 
the  nhole  sequence.  This  sequence  oay  be  either  snapshots  tanen 
very  rapidly  one  aiter  another  (e.g.»  1/24  second  apart),  or 
scenes  such  as  those  in  a conic  strip  which  convey  a story. 

I np lement at  ion  oi  the  three  types  oi  understanding  would  be 
very  diiiicult,  since  the  technologies  available  ior  this  purpose 
in  the  areas  oi  conputer  vision  and  artiiicial  intelligence  are 
in  their  infancies.  1 have  therefore  decided  tc  concentrate  ey 
research  on  unoe r s t and i ng  the  senantics  of  oody  postures,  body 
gestures,  ana  actions,  while  avoiding  the  probleas  of  low>level 
visual  processing  Involved  in  eitraction  of  syntactic  features 
from  pictures  ano  laoeling  body  parts  in  the  picture.  One  way  of 
avoiding  these  problems  while  still  maintaining  a rich  domain  of 
human  gestures  and  actions  is  to  limit  the  domain  to  stick 
fiaures  of  humans.  This  eliminates  not  only  the  problems  of 
extracting,  identifying,  and  labeling  regions  of  the  human  uody, 
but  also  the  problem  of  occlusion  of  one  part  of  the  body  by 
anot  he  r. 

My  research  will  involve  developing  a theory  ana  system 
which  will  recognize  the  actions  and  gestures  of  stick  figures 
engayeo  in  activities  such  as  weeping,  fighting,  pointing, 
scratching,  etc. 
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2>  Motivation  lor  the  Research 


Section  1 points  out  the  desire  of  the  artificial 
Intelligence  coaaunlty  to  develop  systeas  which  can  understand 
and  Interact  with  a coaplea  environaent.  It  Is  therefore 
laportant  to  deveiop  techniques  of  understanding  the  gestures  and 
actions  of  people*  since  people  are  certainly  part  of  a«ny 
env 1 ronaent s . what  are  soae  of  the  specific  applications, 
however,  of  this  research?  What  purpose  would  the  research 
serve? 

One  application  lies  In  the  field  of  robotics*  A aoblle 
robot,  such  as  a robot  house-servant,  a robot  baby-sitter,  or  a 
robot  factory  worker  working  aaong  people,  aay  need  to  have  a 
relatively  sophisticated  capability  of  Interacting  with  people* 
My  research  will  develop  techniques  and  aethods  which  aay  help  In 
providing  this  capability  at  the  visual  level* 

Another  purpose  of  this  research  Is  to  develop  coaputat Iona  I 
techniques  of  unoe r s tand 1 ng  In  general*  Understanding  systeas 
are  laportant  In  the  natural  language  and  problea  solving  ooaalns 
as  well  as  In  the  visual  doaaln*  Many  of  the  Issues  Involved  In 
building  understanding  systeas  are  coaaon  to  all  of  these 
doaalns*  I feel  that  the  stick  figure  world  offers  a rich 
environaent  In  which  to  test  and  develop  coaputatlonal  techniques 
and  aethods  of  understanding*  In  this  regard,  ay  Intention  Is 
not  to  develop  an  ad-hoc  systea  which  can  understand  only  a 
Halted  set  of  stick  figures*  Rather,  I Intend  to  develop  Ideas 
and  techniques  whfch  are  general  enough  to  be  applicable  to  other 
doaalns  of  under  standing*  Eaaaples  of  these  doaalns  alght  be 
scenes  without  people,  natural  language  stories,  and  abstract 
concepts  In  the  sciences  and  arts* 

A final  purpose  of  this  research,  and  one  that  I feel  Is 
rather  laportant.  Is  to  develop  aodels  of  the  cognitive  processes 
used  by  the  huaan  aind  In  perceiving  and  understanding  other 
huaans  In  the  visual  world*  Certainly,  one  can  argue  that  any 
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system  constructed  to  unoerstend  stick  figures  may  h«ve 
aosolutely  nothing  to  do  with  the  way  humans  understand  stick 
figures.  Furthermoret  It  Is  eatremely  dlfflcultt  If  not 
Impost  Iblfi  to  use  today's  technology  to  test  the  validity  of  any 
moael  of  human  understanding.  How  then  can  anyone  claim  that  a 
program  running  in  a computer  has  anything  to  do  with  ahat  goes 
on  in  the  human  mind? 

The  nature  uf  cognitive  processes  Is  very  poorly  understooc 
by  psychologists.  Since  various  methods  of  understanding  these 
processes  have  yielded  only  mooest  results  In  the  past.  1 feel 
that  viewing  the  human  mino  as  an  Information  processing  system 
Is  a valid  approach  which  cannot  be  easily  dismissed.  If  one 
accepts  the  possibility  that  this  approach  might  be  valid,  then 
one  can  begin  to  comprehend  why  a computational  system  which 
understancs  stick  figures  might  provide  a means  of  Investigating 
human  understanding  of  stick  figures.  Certainly,  I would  not 
claim  that  any  system  which  I construct  works  eiactly  the  way  the 
human  mine  works.  What  I do  claim.  however.  Is  that  the 
construction  of  such  systems  at  least  forces  us  to  consider  the 
details  of  what  might  be  Involved  In  cognitive  processes. 
Because  of  this.  the  construction  of  such  systems  provloes 
Insight  Into  the  workings  of  the  human  mind  which  we  might  not 
have  been  able  to  obtain  In  any  other  way.  The  theory  of 
understanding  stick  figures  which  I will  develop  will  probably 
relate  only  marginally  to  the  way  humans  understand  stick 
figures.  Hopefully.  however.  It  will  provide  Insight  Into  the 
way  the  human  mind  accomplishes  this  understanding  as  well  as 
providing  a groundwork  upon  which  better  theories  can  be 
deve  lopeo. 
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Im  Related  hork 


The  following  work  het  been  done  In  eatracting  Inforwetion 
aoout  huaan  figures  froa  pictures* 

Speckert  LS13  has  looked  at  a sequence  of  pictures  of  a aan 
walking  as  seen  froa  a side  view*  His  prograa  eatracts  all  the 
booy  joint  positions  (e*g*t  shoulder*  elbow*  hip*  ankle*  etc*) 
ana  labels  thea*  This  work  coopleaents  ay  research*  since  It 
will  aid  In  devising  stick  figure  representations  of  huaans* 

Aoler  CA1 3 has  built  a systea  which  can  Identify  and  give 
eleoentary  descriptions  of  characters  In  Peanuts  cartoons*  His 
recognition  process  is  concerned  aalnly  with  describing  shapes  of 
irregular*  curveo  objects  and  finding  hidden  contours*  Again* 
this  work  coopleaents  ay  research  since  It  is  concerned  oore  with 
estractlng  and  labeling  body  parts  than  with  aaking  seaantic 
inferences* 

Tsuji*  Morizeno*  and  Kurooa  CTHKi]  have  atteopted  to  analyze 
ana  unoerstand  a siaple  cartoon  flla*  The  input  to  their  systea 
consists  of  oigitized  versions  of  each  fraae  in  the  fila.  The 
systea  is  able  to  answer  eleaentary  questions  about  the  action  in 
the  fila*  This  work  atteapts  to  cover  low-level  visual 
processing  of  the  fila  plus  higher  level  inferences*  Because  of 
this*  the  systea  they  have  built  Is  eleaentary  in  that  It  can 
analyze  only  one  siaple  fila*  However*  this  Is  one  of  the  first 
atteapts  to  analyze  and  understand  a dynaalc  world* 

Soae  preliainary  work  In  aaking  seaantic  Inferences  in  a 
visual  doaain  has  been  done  by  Boose  and  Rieger  CeRl3*  They  are 
concerned  with  extracting  Inforaatlon  froa  the  facial  expressions 
of  characters  in  a children's  story  book*  Facial  features  are 
used  to  Infer  qualities  such  as  anger*  happiness*  and 
frustration*  Hy  research  aay  be  able  to  use  soae  of  the 
techniques  and  aeshanisas  developed  by  Poose  and  Rieger*  although 
I an  concerned  priaarlly  with  expressions  conveyed  by  the  liabs 
of  the  body  rather  than  expressions  of  the  face* 


Understanding  Stick  Figures 


t 


Sftollar  and  heber  CSWl]  are  laoleaentlng  a syste*  whic*'  «1ll 
proCoCe  animation  of  human  aovemfnt  on  a graphics  display*  Their 
internal  re p re  sent  a t 1 on  of  human  movement  Is  In  the  *urm»X  of 
Lauanota t Ion,  a notation  for  recording  human  movement  In  dance. 
I mill  not  be  usln^  an  Internal  representation  of  stick  figures 
based  on  Lab«notat1on  because  1 believe  It  Is  of  no  theoretical 
Interest  In  the  contest  of  u*neral  human  Intelligence. 
Lacanotatlon  Incorporates  a detailed,  symbolic,  7-d 1 mens  1 on« l , 
t 1 m e 'd epe nd ent  representation  ahlch  gives  the  position  of  every 
part  of  the  body.  It  mould  be  very  difficult*  perh«DS 
Impossible,  to  m«p  an  arbitrary  2-d1 men slonal  draming  of  a stick 
figure  Into  this  repi esentat  Ion,  since  the  line  segment  lengths 
ano  angles  are  not  accurately  measured  vhen  dramn.  1 believe 
that  a different  approach  (to  be  esplalned  later)  mill  prove  mure 
feaslole  and  provloe  a better  theoretical  basis. 
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4.  Research  Propasal 


Wy  research  will  Involve  the  construction  and  Imp  I enent a 1 1 on 
o1  a theory  of  stick  figure  understanding.  The  computer 
lap  I eoen t at  1 on  of  this  theory  will  be  referred  to  as  SkELETUM 
(SkELETon  UNde r sta nd Ing ) . 

4.1.  Syntai  Component 

A stick  figure  will  consist  of  the  following  body  parts:  a 
circular  headt  two  upper  arwst  two  lover  arasf  two  handst  two 
upper  legs,  two  lower  legs*  two  feet t and  a torso  consisting  of 
three  parts.  Examples  of  stick  figures  are  shown  in  Displays 
1-1C.  Each  part  of  the  stick  figuret  except  the  head,  will  be  a 
single  straight  line  segment.  The  points  at  which  the  ends  of 
the  line  segments  meet  will  be  the  joints  of  the  stick  figure. 
Since  the  torso  consists  of  three  partSt  It  contains  two  joints. 

The  input  to  SKELETUN  will  consist  of  the  following:  (1)  the 
coordinates  of  the  end  points  of  each  line  segment  of  the  stick 
figure;  (2)  the  points  to  which  each  point  is  connected  by  a line 
segment;  (3)  the  coordinates  of  the  center  of  the  circle 
representing  the  head;  and  (4)  the  radius  of  this  circle.  This 
input  will  be  hand-encoded  by  a human.  Appendix  A contains  the 
LISP  hand-encooeo  input  for  the  figure  in  Display  1.  After 
SKELETUN  receives  the  Input#  it  will  label  all  body  parts  of  the 
stick  figure  (i.e.#  head#  upper  arm#  lower  arm,  etc.)  and 
calculate  the  projected#  2-d 1 aensl ona I angles  of  all  the  joints 
(I.e.#  projected  elbow  joint  angle#  projected  shoulder-upper  srm 
joint  angle#  etc.).  This  Initial  preprocessing  will  provide 
Input  to  the  understanding  component  of  SKELETUN. 
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under  sta  nJinj  Coaponent 

T ne  under  standi ng  coaponent  will  utilize  the  notion  of 
frames  [«1D.  A fraae  is  a data  structure  which  represents  a 
stereotype.  It  contains  constraints  which  aust  be  satisfieo  if 
an  Object  (or  c i r cuas tance } is  to  fall  within  the  stereotype. 
Once  an  object  can  be  seen  to  "fit**  into  a certain  fraae,  strong 
predictions  a oout  properties  of  the  object  can  be  aade  without 
observing  these  properties  directly,  since  the  fraae  will  suggest 
certain  properties  which  are  typical  of  its  instantiations.  The 
frame  provides  a general  aeans  for  recognition.  An  aoditional 
property  of  frames  is  the  following.  The  frames  in  a system  are 
linked  together  in  such  a way  as  to  fora  an  information  retrieval 
network.  If  a fraae  is  invokeo  for  recognizing  a certain  object, 
but  the  object  fails  some  of  its  tests,  then  the  network  has  the 
capability  of  suggesting  other  frames  which  may  be  invoked. 
Examples  of  frames  in  SkElETU|«  are  SADNESS,  RU60IN6-FAC  E , 
SApUTING,  LOOKING,  S HOw 1 NG -0 F F -NUS CL ES , AR C HE D -TOR S 0 , SIDE-VIEW, 
TOUCHING,  and  others  to  be  presented  later.  (The  term  "fraae" 
will  soon  be  abandoned  in  favor  of  a more  specific  term.) 


4.3.  Example 


An  example  of  the  application  of  frames  to  the  understanding 
of  stick  figures  is  the  following.  Suppose  we  input  the  drawing 
of  Display  1 to  SKELETUN.  The  initial  preprocessing  will  label 
all  booy  parts  ana  determine  projected  angles  of  all  the  Joints. 
Let  us  view  the  output  of  the  preprocessor  as  data.  SKELETUN 
will  begin  in  • data-driven  node.  At  this  point,  data  will  be 
used  to  invoke  frames.  Examples  of  the  kino  of  data  which  may  be 
used  for  this  purpose  are:  positions  of  the  feet  and  hands, 
positions  of  the  elbows  and  knees,  shape  of  the  torso,  and 
others.  The  position  of  the  feet  In  Display  1 will  indicate  that 
a side  view  of  the  figure  is  being  observed,  and  thus  a SiDE-vlEw 
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freer  mIU  be  Invoked*  This  frsne  will  deternine  that  the  reel 
envies  et  which  the  elbows  are  bent  (I.e**  In  3-d1nens Iona  I 
space)  are  app roal na tel y the  sane  as  the  elbow  angles  seen  In  the 
2-(i  1 ae  ns  1 one  I drawing.  Neitt  the  positions  of  the  hands  Inside 
the  circle  representing  the  head  will  Invoke  the  TOUCHIMG  frane* 
This  frane  contains  heuristics  to  verify  the  occurrence  of  any 
touching  re  I a t lens h Ips i especially  touching  relationships 
Involving  the  hands  and  feet*  Thus  SKELETUN  again  becones 
f ra *e-or 1 ve n,  or  node  I -dr  1 ven * (The  data-  and  nedel-driven 
concepts  described  here  are  slnllar  to  the  concepts  of  active  and 
passive  knowledge  described  by  Freuder  CF2]*)  The  TOUCHING  frane 
has  Inherited  the  Infornatlon  f ron  the  SIDE-VIEW  frane  that  the 
figure  Is  facing  the  left  and  that  Its  face  Is  thus  probably  also 
pointing  to  the  left*  Since  the  hands  are  Inside  the  left  part 
of  the  circle  representing  the  head*  the  TOUCHING  frane  concludes 
that  the  stick  figure's  hands  are  touching  the  face. 

The  Infornatlon  that  the  hands  are  touching  the  face*  with 
the  elbows  In  the  given  positlont  will  Invoke  a nunber  of 
candidate  franes*  Including  the  SADNESS  frane  anc  the 
RUBBING-FACE  frane*  Suppose  the  SADNESS  frane  Is  given  control 
Initially*  This  frane  will  have  Infornatlon  Indicating  which 
gestures  are  appropriate  when  the  figure  Is  sad*  At  this  point 
SKELETUN  will  again  becone  node l-dri ven * The  SADNESS  frane  will 
contain  heuristics  such  as  the  following:  "One  of  the  things  I 
should  look  for  Is  a bent  back  or  arched  torso*  I will  therefore 
Invoke  the  ARC HED^TORSO  frane  and  see  If  It  can  find  an  arched 
torso***  The  ARCHED-TORSO  frane  will  now  gain  control*  It  will 
have  access  to  all  data  and  results  known  by  the  SADNESS  frane* 
The  ARCHED-TORSO  frane*  knowing  that  the  figure  Is  being  observed 
In  a side  view*  will  easily  find  the  arched  torso  In  the  figure* 
When  this  feature  Is  found*  the  ARCHED-TORSO  frane  will  report 
success  back  to  the  SADNESS  frane*  if  this  Infornatlon  Is  enough 
to  verify  the  SADNESS  frane*  then  SKELETUN  will  **under s tend**  this 
figure  as  being  In  a state  of  "sadness*** 

Suppose  that  SKELETUN  Is  trying  to  understand  Display  2* 
The  positions  of  the  feet*  arns*  and  hands  In  this  drawing  are 
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similar  to  those  in  Display  1.  SKELETUN  mouIo  therefore  again 
invoke  the  SADNESS  frame*  However,  when  the  frame  tries  to 
verify  that  the  figure  has  an  arched  back,  the  ARCHED-TORSO  frame 
will  instead  report  that  the  back  is  straight*  Heuristics  in  the 
control  component  of  SKELETUN  (i*e*,  the  information  retrieval 
network)  will  indicate  that  If  this  occurs,  control  should  be 
passed  to  the  RUcBlNG-FACE  frame*  This  frame  will  inherit  all 
the  data  obtained  oy  the  SADNESS  frame  so  that  it  will  not  have 
to  duplicate  any  work*  The  frame  will  then  attempt  to  determine 
whether  or  not  the  current  stick  figure  "fits  In*"  If  this 
attempt  is  successful,  SKELETUN  will  "understand”  this  stick 
figure  as  being  In  a state  of  "rubbing  Its  face*”  Note  that 

SKELETON  first  tried  SADNESS  rather  than  R UBB 1 NG- F AC E * The 

figures  in  both  Displays  1 and  2 are  rubblno  their  facet* 
Hoaever,  saying  that  the  figure  in  Display  1 is  rubbing  Its  face 

rather  than  oeing  sad  does  not  capture  the  most  significant 

concept  to  be  interpreted*  Similarly,  in  Display  2t  SKELETUN 
would  have  preferred  to  choose  SADNESS*  However,  since  SADNESS 
is  inappropriate,  RUBBING-FACE  becomes  the  most  significant 
concep  t * 

This  example  of  the  use  of  frames  points  out  one  of  its  weak 
points,  namely,  the  serial  passing  of  control  from  one  frame  to 
another*  The  system  would  be  more  elegant  and  more  general  If  it 
did  not  need  a seauencing  component  to  determine  which  frame 
gains  control  next*  This  could  be  Implemented  by  conceptually 
making  the  system  parallel  rather  than  serial  In  Its  operation* 
The  next  section  will  elaborate  upon  this  point* 
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5*  Organization  of  the  Syttaa 


Ac  eiplalneo  above*  the  notion  of  fraaes  will  be  heavily 
utilized  by  SkElETUN*  So  as  not  to  lock  ayself  Into  the  concept 
of  a "frane**  as  defined  by  nlnsky*  1 will  Instead  call  it  a 
**re  cogni  t ion  packet**  to  eapHasIse  Its  aaln  function:  recognition 
of  semantically  meaningful  gestures  and  actions*  The  recognition 
packets  as  described  here  have  some  similarities  to  the  packets 
of  facts  and  demons  described  oy  Fahlman  CF13* 


5*1*  Kecognition  Packets 

SKELETON  will  incorporate  a hierarchy  of  recognition 
packets.  The  first  version  of  SKELETON  (which  will  understand 
isolated  stick  figures)  will  consist  of  primary  and  secondary 
recognition  packets*  Recognition  by  the  primary  packets  will 
constitute  the  main  method  of  **under standing**  by  SKELETON.  These 
packets  will  consist  of  representations  of  semantic  concepts* 
Thus*  If  a stick  figure  drawing  can  be  categorized  as  showing 
SAi^NESS*  or  R Lbbl  N6 -FAC  E * we  will  say  that  the  system  has 
achieved  a high  level  of  understanding*  (See  Section  7 for  more 
examples  of  primary  recognition  packets*) 

The  secondary  recognition  packets  will  constitute  a more 
syntacti c-baseo*  lower  level  of  understanding*  These  will 
Include  ARCHED-TORSO*  CHE ST-TH RUS T-OUT * LE6S -STR AD  OLE D * 

SIDE-VIEN*  FRONT-VIEW*  and  T0UCH1N6*  The  secondary  packets  are 
components  of  recognition  which  by  themselves  are  not  adequate 
concepts  for  **unae  r st  andl  ng**  the  stick  figures*  However*  the 
primary  recognition  packets  may  Invoke  or  obtain  results  from  the 
secondary  packets  to  aid  In  the  recognition  process* 
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5.2.  f'srellellsa  in  the  System 


C onceptue I ly«  SKELETON  will  ineorporete  much  perellelism  in 
its  processing.  Initial  data*  for  eiamplet  need  not  invoke  only 
a single  recognition  packet;  they  may  invoke  several  packets 
*'s  i mul  taneousl  jr ."  Each  packet  will  independently  try  to  verify 
itself.  Those  which  are  unsuccessful  will  simply  "die  out."  The 
others  will  oeclare  themselves  successful.  It  is  important, 
however,  that  all  recognition  packets  be  able  to  communicate  with 
each  other.  This  can  be  implemented  by  means  of  a global  work 
space.  This  concept  is  somewhat  similar  to  the  "blackboard"  used 
in  the  Hearsay  system  CLEl}.  The  work  space  is  an  area  which  is 
accessible  to  all  of  the  recognition  packets.  They  will  place 
their  results  in  the  work  space  and  obtain  the  results  found  by 
other  packets  from  the  work  space.  In  the  example  in  Section  4, 
the  SIDE-VIEW,  ARCHED-TORSO,  and  TOUCHING  packets  will  pass  their 
results  to  the  SADNESS  and  RUBBING-FACE  packets  via  the  work 
space.  If  one  recognition  packet  needs  to  invoke  another  one, 
this  can  be  done  implicitly  by  placing  a message  in  the  work 
spa  c e. 

The  concept  of  the  work  space  allows  all  recognition  packets 
to  be  independent  modules.  Thus,  if  one  wants  to  ado  a new 
packet  to  the  system,  he  need  not  know  about  all  of  the  other 
packets  already  in  the  system.  Furthermore,  the  independence  of 
recognition  packets  implies  that  the  system  can  be  run  on  a 
parallel  machine,  since  there  need  be  no  higher  level  sequencing 
component  to  determine  which  packet  must  be  invoked  when.  The 
notion  of  parallel  recognition  packets  acting  simultaneously  is 
also  theoretically  pleasing  since  the  human  mind  seems  to  work 
simi  larly. 


5.3.  Example  of  the  Control  Structure 


Consider  the  example  in  Section  4 in  light  of  this  parallel 
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organisation.  Display  11  shews  the  hierarchy  consisting  of  data 
at  the  lowest  levelf  and  secondary  and  prioary  recognition 
packets  at  the  nest  two  levels.  Display  12  Is  what  a trace  of 
the  progra*  running  on  this  eiaople  slight  look  like.  Either  a 
recognition  packet  or  data  oay  be  In  control.  When  a recognition 
packet  Is  In  controlt  we  say  that  the  systew  Is  oodel-drl ven.  As 
the  packet  atteopts  to  verify  Itselft  It  will  oake  assertions 
which  are  placed  In  the  global  work  space.  Recall  that  all 
assertions  In  the  work  space  are  accessible  to  all  packets.  When 
data  are  In  control*  we  say  that  the  systea  Is  data-driven.  in 
Display  12t  an  assertion  aade  by  data  Indicates  which  data  are 
Invoking  the  aodels.  The  purpose  of  this  display  Is  to  Indicate 
the  control  used  for  this  eaaaple*  Display  11*  on  the  other 
hand*  Indicates  the  organization  of  the  systea  required  for  this 
ea  a apl e . 

The  nuabers  aoove  the  boaes  In  Display  11  Indicate  the  order 
In  which  each  recognition  packet  Is  Invoked.  Note  that  the  three 
secondary  recognition  packets  are  all  Invoked  by  the  data 
slau  It aneou s ly  . However*  both  ARCHED-TORSO  and  TOUCHING  need  to 
know  which  view  of  the  figure  Is  being  observed.  They  therefore 
aust  wait  for  SIDE-VIEW  to  place  this  Inforaatlon  In  the  work 
space  cefore  they  can  proceed.  SADNESS  and  RUBBING-FACE  are  also 
Invoked  slault aneous ly*  but  only  after  TOUCHING  and  SIDE-VIEW 
have  asserted  the  following:  (1)  the  2-d1aens Iona  I angles  of  the 
elbows  are  alaost  the  saae  as  the  3-d1aens1 ona I angles;  and  (2) 
the  hands  are  touching  the  face.  By  this  tiae*  ARCHED-TORSO  has 
already  asserted  that  the  torso  of  the  figure  Is  arched*  and 
therefore  SADNESS  reports  success  while  RUBBING-FACE  reports 
fa  1 lure* 


5.4.  Body  Specialists 


Instead  of  using  only  the  output  of  the  preprocessor  as  data 
(1.e.«  labels  of  body  parts  and  projected  angles  of  joints)*  1 
nay  try  the  following.  Each  body  part  of  the  stick  figure*  1 .e . * 
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right  foott  left  foott  right  bend*  left  hand*  right  upper  arwt 
left  u(per  a r* , etc>t  can  be  represented  by  a specialist  whose 
purpose  is  to  produce  a syaboMc  representation  of  the  position 
of  its  body  part  In  3-d 1 pens  1 onal  space.  In  this  aanneri  a foot 
may  be  represented  as  "pointing  left,"  "pointing  right,"  or 
"pointing  forwaro"  with  respect  to  the  viewer  of  the  drawing. 
This  would  provioe  the  syste*  with  two  levels  of  "data."  The 
first  level  woulo  consist  of  the  output  of  the  preprocessor;  the 
second  level  woulo  consist  of  the  syabolic  representation  of  the 
positions  of  the  oody  parts.  In  this  way,  the  recognition 
packets  nay  look  at  either  level  of  oata. 


5.5.  hotes  on  Inp lenent at  Ion 


SkELETUN  will  be  inplenented  in  LISP.  It  nay  incorporate 
"demons"  in  Its  i ap leaentat ion . Conceptually,  a denon"*  purpose 
is  to  w* t "esciteo"  when  a particular  set  of  events  occur.  The 
oenon  "watches"  an  arena  of  events,  and  when  the  set  of  events 
for  which  it  was  created  occurs,  the  deaon  executes  a body  of 
instructions.  In  SKELETUN,  demons  nay  De  used  for  booy  part 
specialists  or  for  Invoking  recognition  packets.  An  Interesting 
iap I enentat ion  of  deaons  In  the  forn  of  "spontaneous 
conput at  ions"  has  oeen  done  by  Rieger  CP23.  I nay  adapt  this 
lap  leaentat Ion  for  use  In  SKELETUN. 
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6.  Stegcs  In  the  Research 


The  proposeo  research  »ay  be  divided  Into  three  stages.  The 
first  stage  will  Involve  constructing  and  lapleaenting  a theory 
of  unoe rst and  1 ng  Isolated  stick  figures.  This  Is  the  Initial 
purpose  of  the  researcht  and  aost  of  the  Ideas  In  this  paper  are 
directed  towaro  that  end. 

The  secono  stage  of  this  research  alll  Involve  adding  the 
capability  of  understanding  two  or  aore  stick  figures  which  are 
interacting.  Consider  the  eiaaple  provided  earllert  consisting 
of  one  figure  standing  with  Its  aras  straight  up  in  the  air  and 
another  figure  holding  a gun  pointed  towards  the  first  figure. 
If  one  were  to  atteapt  to  understand  each  figure  Individually,  a 
description  of  the  scene  night  be:  "One  figure  Is  stretching  and 
the  other  Is  pointing  a gun."  However,  an  understanding  Involving 
the  two  figures  as  Interacting  hunans  night  be:  "A  robbery  is 
occurring!"  Indeed,  the  understanding  of  the  scene  changed  when 
the  total  context  was  taken  Into  account. 

Other  exanples  of  high  level  seaantlc  Inferences  that  can  be 
aade  about  groups  of  stick  figures  are  (1)  "chorus  line"  instead 
of  "a  group  of  dancers;"  (2)  "track  neet"  instead  of  "a  bunch  of 
people  running;"  and  (3)  "soccer  gaae"  Instead  of  "a  bunch  of 
peop  le  running  ." 

The  third  stage  of  this  research  will  Involve  Investigating 
the  understanding  of  coolc  strips  which  are  purely  pictorial  and 
consist  only  of  stick>f1gure  characters.  This  would  probably  use 
■any  of  the  techniques  used  In  natural  language  story 
coaprehensl on  [Rl].  since  a pictorial  co«1c  strip  tells  a story. 
A picture-story  understanding  systea  will  be  able  to  Infer  all  of 
the  cause-effect  relationships  which  occur  fro*  one  picture  fraae 
of  the  coalc  strip  to  another.  The  systeo  will  display  Its 
understanding  of  the  coalc  strip  by  outputing  an  English  suaaary 
of  t he  story  I Ine. 
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7 . Tht  Date  Base 


The  oat*  case  consists  of  drawings  of  stick  figures  by  aany 
different  people.  Displays  I'lC  show  exaaples  of  drawings  In  the 
data  case.  The  first  version  of  SKELETUN  will  have  ten  prlaary 
recognition  packets  to  be  used  for  understanding.  The  naaes  of 
these  packets,  plus  examples  of  stick  figures  whose  gestures  can 
be  categorized  by  these  packets,  are  the  following: 

(1)  weeping  or  sad  - Display  1 

(2)  rubblnp  face  • Display  2 

(3)  relaxec  position  - Display  3 

(A)  saluting  - Display  A 

15)  looking  or  seeking  - Display  5 

(6)  pointing  • Display  6 

(7)  showing  off  siuscles  - Display  7 

(8)  cuppln.  ear  to  hear  better  - Display  8 

(9)  boxing  stance  - Display  9 

(10)  scratching  own  back  ■>  Display  10 

Note  that  most  of  the  recognition  packets  deal  with  static 
gestures  rather  than  actions,  particularly  vigorous  actions.  1 
feel  It  Is  best  to  start  with  static  gestures  since  they  are 
easier  to  understand. 

The  examples  shown  In  Displays  1-1C  are  only  part  of  the 
whole  oata  base  presently  available.  This  data  base  was  recently 
obtained  from  wn  Introductory  drawing  class  at  University  of 
Maryland.  A brief  description  of  each  of  the  ten  concepts  was 
given  to  the  forty  students  In  the  class.  Each  student  was  then 
asked  to  draw  stick  figures  which  they  thought  would  Instantiate 
each  concept. 

Each  drawing  will  be  presented  to  SKELETUN.  If  the  program 
chooses  one  of  the  ten  recognition  packets,  then  we  can  say  that 
It  has  a good  "understanding"  of  the  drawing.  The  program  may 
choose  more  than  one  packet,  In  which  case  It  will  be  saying: 
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"I'm  not  really  sure  what's  going  on  In  the  drawingt  but  It  looks 
as  If  the  stick  figure  Is  doing  soaething  which  falls  soaewhere 
within  these  packets*"  If  It  chooses  no  packet*  this  aeans  that 
SKCLETuN  cannot  "understand"  the  drawing  based  on  what  Is 
presently  In  Its  memory.  This  Is  acceptable  since  people  often 
have  the  saae  problea*  I.e**  people  quite  often  cannot  really 
unoerstand  what  a stick  figure  Is  doing*  since  It  night  be  doing 
soaething  which  cannot  be  discerned*  or  It  aay  be  doing  nothing 
in  particular*  When  this  case  arises*  SKELETUN  will  try  to 
provide  a concise  description  of  the  figure  by  using  the 
secondary  recognition  packets  which  were  Instantiated*  For 
esanple*  SKELETUh  alght  output:  "This  figure  has  its  left  leg  up 
in  the  air*  Its  chest  thrust  out*  and  Its  hands  clasped  on  top  of 
its  head*"  This  kind  of  description  shows  a partial  understanding 
which  is  based  aore  on  the  syntax  of  the  stick  figure  than  on  the 
seaantics.  In  the  second  version  of  SKELETUN*  where  drawings 
will  involve  the  Interactions  of  two  or  aore  stick  figures* 
partial  descriptions  of  each  stick  figure*  when  coablned  under  a 
higher  level  recognition  packet*  alght  result  In  a fuller 
understanding  of  the  scene* 
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6.  Testing  Methodology 


The  theory  of  unde r st a no  1 ng  which  l will  constructt  olus  its 
i mp I ee c n t at i on  through  SKELETUhi  will  need  to  he  tested  to 
determine  its  adequacy*  In  order  to  test  the  first  version  of 
SKELETON*  1 will  obtain  a new  data  base  of  drawings  from  both 
amateurs  and  students  of  art*  In  addition*  the  data  base  will 
contain  drawings  of  stick  figures  derived  from  cartoons  and 
photographs*  The  orawings  will  be  done  by  people  unfamiliar  with 
the  progra'^.  The  same  set  of  drawings  will  then  be  presented  to 
a group  of  subjects  who  also  are  unfamiliar  with  the  program  and 
who  have  not  seen  the  drawings  before*  The  subjects  will  be 
askec  to  concisely  describe  what  each  stick  figure  is  doing* 
These  descriptions  will  be  comparec  to  the  recognition  packets  or 
brief  descriptions  chosen  by  SKELETON*  For  each  stick  figure  In 
the  oata  base*  if  the  subjects  choose  a description  which  Is 
similar  to  one  of  the  concepts  represented  by  a recognition 
packet  in  SKELETON*  then  me  would  expect  SKELETON  to  choose  that 
same  recognition  packet* 
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9.  Cone lus ion  s 


The  donain  of  stick  figures  is  rich  enough  to  provide  a gooo 
environnent  in  which  to  develop  and  test  techniques  and  Methods 
of  visual  perception  and  understanding.  Furtheraorei  the  doaain 
of  stick  figures  is  interesting  in  its  own  right;  that  is,  a 
theory  and  coaputer  systea  for  stick  figure  understanding  can  be 
applieo  to  practical  engineering  probleas  of  robot  interaction 
with  huaans.  Finally,  since  the  domain  of  stick  figures  has  been 
relatively  uneaploreo,  research  in  this  area  is  overdue. 


Ac  k now  le dgaent 


I would  like  tc  thank  Prof*  Cynthia  Bickley  for  allowing  me  to 
use  her  drawing  class  to  obtain  the  data  bate  of  stick  figures. 
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11.  A(,per>d1»  A 


This  appendix  sho«>  the  hand-encoded  input  to  SKELFTUN  of 
the  stick  figure  in  Display  1.  The  coding  is  in  LISP.  The  stick 
figure  contains  17  end  points  of  the  line  segnents.  plus  one 
point  for  the  center  of  the  circle  representing  the  head.  The 
eno  points  are  labelled  from  1 to  17.  The  a and  y coordinates  of 
each  point  are  oeterained  based  on  coordinate  axes  uhose  origin 
and  orientation  is  arbitrarily  chosen.  This  procedure  is 
performed  by  a huaan  rather  than  by  a program.  Note  that  eacn 
eno  point  shares  line  segments  with  at  least  one  other  point. 


The  circle  is  represented  as  follows: 

( <x  coord,  of  ctr.>  <y  coord,  of  ctr.>  <radius>  ) 

Each  end  point  is  representeo  as  follows: 

( < point  no.>  ( <x  coord. > <y  coord. > ) <a  list 
of  the  other  points  It  is  connected  to>  ) 


LISP  Input 
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T 

( 

14.0 

3.7  ) 

( 1 

6 )> 

( 

4 

( 

13.0 

3.9  ) 

( 2 

5 )) 

( 

c 

( 

10.9 

21.4  ) 

( 4 

7 )) 

( 

6 

( 

11.5 

22.5  ) 

( 3 

7 )) 

( 

7 

( 

14.3 

36.0  ) 

( 5 

6 0 )) 

( 

8 

< 

15.4 

43.6  ) 

( 7 

11  )> 

( 

9 

( 

2.7 

51.5  ) 

( 12 

15  >) 

( 

10 

( 4.3 

51.0  ) 

( 12 

16  )) 

Understanding  Stick 


( 

1 1 

( 

15.1 

52.0 

) 

( 8 

12  >) 

( 

u 

( 

12  .4 

5c. 0 

) 

( 11 

10 

( 

1 3 

( 

6.9 

65.5  ) 

( 

15 

)) 

( 

14 

( 

7.7 

c5.7  ) 

( 

16 

) ) 

( 

1 3 

( 

6.6 

63.2  ) 

( 

9 

13  ) ) 

( 

16 

( 

7.8 

62.1  ) 

( 

10 

14  )) 

( 

1 7 

( 

11  .0 

61.2 

) 

< 12 

))) 

figures 


)) 
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IN  CONTROL 


ASSERTIONS 


Data 

(1)  feet  pointing  left 

SIDE-VIEW 

(1)  3-D  and  2-D  elbow  angles 

are  almost  the  same 

(2)  figure  facing  left  part 

of  page 

(3)  face  is  pointing  to  left 

part  of  page 

Data 

(1)  hands  inside  circle 

TOUCHING 

(1)  hands  are  inside  left 

part  of  circle 

(2)  hands  touching  face 

ARCHED-TORSO 

(1)  arched  torso  exists 

SADNESS 

success 

RUBBING-FACE 

fail 

END 


Note  that  all  assertions  by  models  are  placed  in  the 
global  work  space. 
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